Data Analysis & Visualization - Michael ChengΒΆ
Project Problem Statement - Auto-mpg AnalysisΒΆ
Background_1ΒΆ
Context
The shifting market conditions, globalization, cost pressure, and volatility are leading to a change in the automobile market landscape. The emergence of data, in conjunction with machine learning in automobile companies, has paved a way that is helping bring operational and business transformations.
The automobile market is vast and diverse, with numerous vehicle categories being manufactured and sold with varying configurations of attributes such as displacement, horsepower, and acceleration. We aim to find combinations of these features that can clearly distinguish certain groups of automobiles from others through this analysis, as this will inform other downstream processes for any organization aiming to sell each group of vehicles to a slightly different target audience.
You are a Data Scientist at SecondLife which is a leading used car dealership with numerous outlets across the US. Recently, they have started shifting their focus to vintage cars and have been diligently collecting data about all the vintage cars they have sold over the years. The Director of Operations at SecondLife wants to leverage the data to extract insights about the cars and find different groups of vintage cars to target the audience more efficiently. [1]
Objective
The objective of this problem is to explore the data, extract meaningful insights, and find different groups of vehicles in the data by using dimensionality reduction techniques like PCA and t-SNE.
Data Description: There are 8 variables in the dataset:
mpg: miles per gallon
cyl: number of cylinders
disp: engine displacement (cu. inches) or engine size
hp: horsepower
wt: vehicle weight (lbs.)
acc: time taken to accelerate from 0 to 60 mph (sec.)
yr: model year
car name: car model name
Import Libraries & Load the dataΒΆ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Importing PCA and t-SNE
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Summary Tools
from summarytools import dfSummary
data = pd.read_csv(r"C:\Users\mikecbos\Downloads\MIT_Elective-AutoMPG\auto-mpg.csv")
DataPreprocessing_1ΒΆ
# Copy of data
df = data.copy()
# Overview of data
print(df.head())
df.info()
dfSummary(df)
mpg cylinders displacement horsepower weight acceleration model year \
0 18.0 8 307.0 130 3504 12.0 70
1 15.0 8 350.0 165 3693 11.5 70
2 18.0 8 318.0 150 3436 11.0 70
3 16.0 8 304.0 150 3433 12.0 70
4 17.0 8 302.0 140 3449 10.5 70
car name
0 chevrolet chevelle malibu
1 buick skylark 320
2 plymouth satellite
3 amc rebel sst
4 ford torino
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 398 non-null object
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model year 398 non-null int64
7 car name 398 non-null object
dtypes: float64(3), int64(3), object(2)
memory usage: 25.0+ KB
| No | Variable | Stats / Values | Freqs / (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
| 1 | mpg [float64] |
Mean (sd) : 23.5 (7.8) min < med < max: 9.0 < 23.0 < 46.6 IQR (CV) : 11.5 (3.0) |
129 distinct values | 0 (0.0%) |
|
| 2 | cylinders [int64] |
Mean (sd) : 5.5 (1.7) min < med < max: 3.0 < 4.0 < 8.0 IQR (CV) : 4.0 (3.2) |
5 distinct values | 0 (0.0%) |
|
| 3 | displacement [float64] |
Mean (sd) : 193.4 (104.3) min < med < max: 68.0 < 148.5 < 455.0 IQR (CV) : 157.8 (1.9) |
82 distinct values | 0 (0.0%) |
|
| 4 | horsepower [object] |
1. 150 2. 90 3. 88 4. 110 5. 100 6. 75 7. 95 8. 105 9. 70 10. 67 11. other |
22 (5.5%) 20 (5.0%) 19 (4.8%) 18 (4.5%) 17 (4.3%) 14 (3.5%) 14 (3.5%) 12 (3.0%) 12 (3.0%) 12 (3.0%) 238 (59.8%) |
0 (0.0%) |
|
| 5 | weight [int64] |
Mean (sd) : 2970.4 (846.8) min < med < max: 1613.0 < 2803.5 < 5140.0 IQR (CV) : 1384.2 (3.5) |
351 distinct values | 0 (0.0%) |
|
| 6 | acceleration [float64] |
Mean (sd) : 15.6 (2.8) min < med < max: 8.0 < 15.5 < 24.8 IQR (CV) : 3.3 (5.6) |
95 distinct values | 0 (0.0%) |
|
| 7 | model year [int64] |
Mean (sd) : 76.0 (3.7) min < med < max: 70.0 < 76.0 < 82.0 IQR (CV) : 6.0 (20.6) |
13 distinct values | 0 (0.0%) |
|
| 8 | car name [object] |
1. ford pinto 2. toyota corolla 3. amc matador 4. ford maverick 5. chevrolet chevette 6. amc gremlin 7. chevrolet impala 8. peugeot 504 9. amc hornet 10. toyota corona 11. other |
6 (1.5%) 5 (1.3%) 5 (1.3%) 5 (1.3%) 4 (1.0%) 4 (1.0%) 4 (1.0%) 4 (1.0%) 4 (1.0%) 4 (1.0%) 353 (88.7%) |
0 (0.0%) |
The_overview_above_shows:ΒΆ
- There are no duplicates or missing values.
- The average model year is 1976, with a relatively low spread (3.7 years).
- The minimum year is 1970, the median is 1976, and the maximum is 1982, showing a symmetric spread around the median.
- With an Inter-Quartile Range (IQR) at 6.0; this suggests the middle 50% of model years span 6 years (i.e. from 1973 to 1979, depending on the exact quartiles).
- Coefficient of Variation (CV) shows a relative variability of 20.6%; this shows a moderately high standard deviation at about 20.6% of the mean, indicating reasonable variability in model years
- The dataset represents a vehicle's Performance Features in the following:
- mpg
- cylinders
- displacement
- horsepower
- weight
- acceleration
# Objects Columns: Review 'car name'
df['car name'].unique()
array(['chevrolet chevelle malibu', 'buick skylark 320',
'plymouth satellite', 'amc rebel sst', 'ford torino',
'ford galaxie 500', 'chevrolet impala', 'plymouth fury iii',
'pontiac catalina', 'amc ambassador dpl', 'dodge challenger se',
"plymouth 'cuda 340", 'chevrolet monte carlo',
'buick estate wagon (sw)', 'toyota corona mark ii',
'plymouth duster', 'amc hornet', 'ford maverick', 'datsun pl510',
'volkswagen 1131 deluxe sedan', 'peugeot 504', 'audi 100 ls',
'saab 99e', 'bmw 2002', 'amc gremlin', 'ford f250', 'chevy c20',
'dodge d200', 'hi 1200d', 'chevrolet vega 2300', 'toyota corona',
'ford pinto', 'plymouth satellite custom', 'ford torino 500',
'amc matador', 'pontiac catalina brougham', 'dodge monaco (sw)',
'ford country squire (sw)', 'pontiac safari (sw)',
'amc hornet sportabout (sw)', 'chevrolet vega (sw)',
'pontiac firebird', 'ford mustang', 'mercury capri 2000',
'opel 1900', 'peugeot 304', 'fiat 124b', 'toyota corolla 1200',
'datsun 1200', 'volkswagen model 111', 'plymouth cricket',
'toyota corona hardtop', 'dodge colt hardtop', 'volkswagen type 3',
'chevrolet vega', 'ford pinto runabout', 'amc ambassador sst',
'mercury marquis', 'buick lesabre custom',
'oldsmobile delta 88 royale', 'chrysler newport royal',
'mazda rx2 coupe', 'amc matador (sw)',
'chevrolet chevelle concours (sw)', 'ford gran torino (sw)',
'plymouth satellite custom (sw)', 'volvo 145e (sw)',
'volkswagen 411 (sw)', 'peugeot 504 (sw)', 'renault 12 (sw)',
'ford pinto (sw)', 'datsun 510 (sw)',
'toyouta corona mark ii (sw)', 'dodge colt (sw)',
'toyota corolla 1600 (sw)', 'buick century 350',
'chevrolet malibu', 'ford gran torino', 'dodge coronet custom',
'mercury marquis brougham', 'chevrolet caprice classic',
'ford ltd', 'plymouth fury gran sedan',
'chrysler new yorker brougham', 'buick electra 225 custom',
'amc ambassador brougham', 'plymouth valiant',
'chevrolet nova custom', 'volkswagen super beetle', 'ford country',
'plymouth custom suburb', 'oldsmobile vista cruiser',
'toyota carina', 'datsun 610', 'maxda rx3', 'mercury capri v6',
'fiat 124 sport coupe', 'chevrolet monte carlo s',
'pontiac grand prix', 'fiat 128', 'opel manta', 'audi 100ls',
'volvo 144ea', 'dodge dart custom', 'saab 99le', 'toyota mark ii',
'oldsmobile omega', 'chevrolet nova', 'datsun b210',
'chevrolet chevelle malibu classic', 'plymouth satellite sebring',
'buick century luxus (sw)', 'dodge coronet custom (sw)',
'audi fox', 'volkswagen dasher', 'datsun 710', 'dodge colt',
'fiat 124 tc', 'honda civic', 'subaru', 'fiat x1.9',
'plymouth valiant custom', 'mercury monarch', 'chevrolet bel air',
'plymouth grand fury', 'buick century',
'chevroelt chevelle malibu', 'plymouth fury', 'buick skyhawk',
'chevrolet monza 2+2', 'ford mustang ii', 'toyota corolla',
'pontiac astro', 'volkswagen rabbit', 'amc pacer', 'volvo 244dl',
'honda civic cvcc', 'fiat 131', 'capri ii', 'renault 12tl',
'dodge coronet brougham', 'chevrolet chevette', 'chevrolet woody',
'vw rabbit', 'dodge aspen se', 'ford granada ghia',
'pontiac ventura sj', 'amc pacer d/l', 'datsun b-210', 'volvo 245',
'plymouth volare premier v8', 'mercedes-benz 280s',
'cadillac seville', 'chevy c10', 'ford f108', 'dodge d100',
'honda accord cvcc', 'buick opel isuzu deluxe', 'renault 5 gtl',
'plymouth arrow gs', 'datsun f-10 hatchback',
'oldsmobile cutlass supreme', 'dodge monaco brougham',
'mercury cougar brougham', 'chevrolet concours', 'buick skylark',
'plymouth volare custom', 'ford granada', 'pontiac grand prix lj',
'chevrolet monte carlo landau', 'chrysler cordoba',
'ford thunderbird', 'volkswagen rabbit custom',
'pontiac sunbird coupe', 'toyota corolla liftback',
'ford mustang ii 2+2', 'dodge colt m/m', 'subaru dl', 'datsun 810',
'bmw 320i', 'mazda rx-4', 'volkswagen rabbit custom diesel',
'ford fiesta', 'mazda glc deluxe', 'datsun b210 gx',
'oldsmobile cutlass salon brougham', 'dodge diplomat',
'mercury monarch ghia', 'pontiac phoenix lj',
'ford fairmont (auto)', 'ford fairmont (man)', 'plymouth volare',
'amc concord', 'buick century special', 'mercury zephyr',
'dodge aspen', 'amc concord d/l',
'buick regal sport coupe (turbo)', 'ford futura',
'dodge magnum xe', 'datsun 510', 'dodge omni',
'toyota celica gt liftback', 'plymouth sapporo',
'oldsmobile starfire sx', 'datsun 200-sx', 'audi 5000',
'volvo 264gl', 'saab 99gle', 'peugeot 604sl',
'volkswagen scirocco', 'honda accord lx', 'pontiac lemans v6',
'mercury zephyr 6', 'ford fairmont 4', 'amc concord dl 6',
'dodge aspen 6', 'ford ltd landau', 'mercury grand marquis',
'dodge st. regis', 'chevrolet malibu classic (sw)',
'chrysler lebaron town @ country (sw)', 'vw rabbit custom',
'maxda glc deluxe', 'dodge colt hatchback custom', 'amc spirit dl',
'mercedes benz 300d', 'cadillac eldorado', 'plymouth horizon',
'plymouth horizon tc3', 'datsun 210', 'fiat strada custom',
'buick skylark limited', 'chevrolet citation',
'oldsmobile omega brougham', 'pontiac phoenix',
'toyota corolla tercel', 'datsun 310', 'ford fairmont',
'audi 4000', 'toyota corona liftback', 'mazda 626',
'datsun 510 hatchback', 'mazda glc', 'vw rabbit c (diesel)',
'vw dasher (diesel)', 'audi 5000s (diesel)', 'mercedes-benz 240d',
'honda civic 1500 gl', 'renault lecar deluxe', 'vokswagen rabbit',
'datsun 280-zx', 'mazda rx-7 gs', 'triumph tr7 coupe',
'ford mustang cobra', 'honda accord', 'plymouth reliant',
'dodge aries wagon (sw)', 'toyota starlet', 'plymouth champ',
'honda civic 1300', 'datsun 210 mpg', 'toyota tercel',
'mazda glc 4', 'plymouth horizon 4', 'ford escort 4w',
'ford escort 2h', 'volkswagen jetta', 'renault 18i',
'honda prelude', 'datsun 200sx', 'peugeot 505s turbo diesel',
'volvo diesel', 'toyota cressida', 'datsun 810 maxima',
'oldsmobile cutlass ls', 'ford granada gl',
'chrysler lebaron salon', 'chevrolet cavalier',
'chevrolet cavalier wagon', 'chevrolet cavalier 2-door',
'pontiac j2000 se hatchback', 'dodge aries se',
'ford fairmont futura', 'amc concord dl', 'volkswagen rabbit l',
'mazda glc custom l', 'mazda glc custom', 'plymouth horizon miser',
'mercury lynx l', 'nissan stanza xe', 'honda civic (auto)',
'datsun 310 gx', 'buick century limited',
'oldsmobile cutlass ciera (diesel)', 'chrysler lebaron medallion',
'ford granada l', 'toyota celica gt', 'dodge charger 2.2',
'chevrolet camaro', 'ford mustang gl', 'vw pickup',
'dodge rampage', 'ford ranger', 'chevy s-10'], dtype=object)
Decision Point:ΒΆ
- Because the firm has "recently started" [1] shifting their attention to vintage vehicles, this dataset will contain vintage and non-vintage vehicles.
- Although "vintage" may literally be designated to a particular year rather than the particular make/model of a vehicle, there is also a significant factor from cultural perception that plays into a vehicle's vintage value.
- Since the firm desires to "leverage the data to extract insights about the cars and find different groups of vintage cars to target the audience more efficiently" [1], 'car name' will be considered along with other features.
# Objects Columns: Review 'horsepower'
df['horsepower'].unique()
array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
'170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
'200', '210', '193', '?', '100', '105', '175', '153', '180', '110',
'72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
'112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
'75', '91', '122', '67', '83', '78', '52', '61', '93', '148',
'129', '96', '71', '98', '115', '53', '81', '79', '120', '152',
'102', '108', '68', '58', '149', '89', '63', '48', '66', '139',
'103', '125', '133', '138', '135', '142', '77', '62', '132', '84',
'64', '74', '116', '82'], dtype=object)
# Undefined value "?" occurence
print("Instances of ? in 'horsepower'")
df['horsepower'].value_counts()['?']
Instances of ? in 'horsepower'
6
Decision Point:ΒΆ
- "Horsepower" should be converted to a numeric data type for meaningful analysis and visualization
- "?" values determination: Use Regression (generalization based on global patterns) rather than KNN accounting for variabilities based on local patterns
# Use Regression to predict "?" horsepower values
from sklearn.linear_model import LinearRegression
# Replace '?' with NaN and convert to numeric
df['horsepower'] = pd.to_numeric(df['horsepower'].replace('?', np.nan), errors='coerce')
# Split rows with and without missing horsepower
df_missing_hp = df[df['horsepower'].isna()]
df_non_missing_hp = df[~df['horsepower'].isna()]
# Features and target for non-missing rows
X = df_non_missing_hp[['mpg', 'cylinders', 'displacement', 'weight', 'acceleration', 'model year']]
y = df_non_missing_hp['horsepower']
# Train a regression model
model = LinearRegression()
model.fit(X, y)
# Predict missing horsepower
X_missing = df_missing_hp[['mpg', 'cylinders', 'displacement', 'weight', 'acceleration', 'model year']]
df.loc[df['horsepower'].isna(), 'horsepower'] = model.predict(X_missing)
# Display updated DataFrame
df
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | car name | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | ford torino |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | ford mustang gl |
| 394 | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | vw pickup |
| 395 | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 82 | dodge rampage |
| 396 | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 82 | ford ranger |
| 397 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 82 | chevy s-10 |
398 rows Γ 8 columns
# Check for null values
df['horsepower'].isna().sum()
0
# Review unique values
df['horsepower'].unique()
array([130. , 165. , 150. , 140. ,
198. , 220. , 215. , 225. ,
190. , 170. , 160. , 95. ,
97. , 85. , 88. , 46. ,
87. , 90. , 113. , 200. ,
210. , 193. , 63.91642352, 100. ,
105. , 175. , 153. , 180. ,
110. , 72. , 86. , 70. ,
76. , 65. , 69. , 60. ,
80. , 54. , 208. , 155. ,
112. , 92. , 145. , 137. ,
158. , 167. , 94. , 107. ,
230. , 49. , 75. , 91. ,
122. , 96.55621422, 67. , 83. ,
78. , 52. , 61. , 93. ,
148. , 129. , 96. , 71. ,
98. , 115. , 53. , 81. ,
79. , 120. , 152. , 102. ,
108. , 68. , 58. , 149. ,
89. , 63. , 48. , 66. ,
139. , 103. , 125. , 133. ,
138. , 135. , 142. , 77. ,
57.85814468, 62. , 132. , 102.00825115,
84. , 64. , 74. , 76.40523288,
116. , 75.53953501, 82. ])
Decision PointΒΆ
There are a few instances of trailing decimals that will be distracting for display purposes. This feature, however, will require domain experts to validate. Moreover, Vintage Classification is not primarily determined by horsepower, or any other performance features (see above). Thus, it seems better to leave these records as flagged entries by virtue of their trailing decimals, and allow domain experts to review.
# Create an index column for visualization traceability
df['index'] = df.index
# Review preprocessed data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 398 non-null float64 1 cylinders 398 non-null int64 2 displacement 398 non-null float64 3 horsepower 398 non-null float64 4 weight 398 non-null int64 5 acceleration 398 non-null float64 6 model year 398 non-null int64 7 car name 398 non-null object 8 index 398 non-null int64 dtypes: float64(4), int64(4), object(1) memory usage: 28.1+ KB
Observations:ΒΆ
- Make and Model, as discused above, can be sigificant in the determination of a vehicle's vintage classification
- Most (89%) of the 'car name' feature is singular (designated as "Other" in frequency distribution)[2], which can be the very reason a vehicle could be classifed as "vintage"
- Therefore, these 'car name' labels should be further reviewed, and disaggregated if possible
Exploratory Data Analysis and Hypothesis GenerationΒΆ
t-SNE_previewΒΆ
# t-SNE analysis to review car name
from sklearn.manifold import TSNE
import plotly.express as px
import warnings
# Future warning suppressed
warnings.filterwarnings('ignore', category=FutureWarning)
# Prepare t-SNE data
features = df.drop('car name', axis=1)
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(features)-1))
features_tsne = tsne.fit_transform(features)
# Create DataFrame
tsne_df = pd.DataFrame(features_tsne, columns=['t-SNE 1', 't-SNE 2'])
tsne_df['car name'] = df['car name']
# Plotly visualization with tuple for color parameter
fig = px.scatter(tsne_df, x='t-SNE 1', y='t-SNE 2',
color='car name',
hover_data=['car name'],
title='t-SNE Visualization of car name')
fig.update_layout(
title='t-SNE Visualization of Car Name',
autosize=False,
width=800,
height=300,
xaxis=dict(
title='',
showticklabels=False,
showgrid=False,
zeroline=False
),
yaxis=dict(
title='',
showticklabels=False,
showgrid=False,
zeroline=False
)
)
fig.show()
# t-SNE to review only "Other" in car name
# Identify car names with a frequency of 1
singular_names = df['car name'].value_counts()[df['car name'].value_counts() == 1].index
# Filter the dataset to include only those car names
singular_df = df[df['car name'].isin(singular_names)]
# Drop 'car name' for t-SNE, use other features for singular entries only
features_singular = singular_df.drop('car name', axis=1)
# Apply t-SNE to singular entries
tsne_singular = TSNE(n_components=2, random_state=42, perplexity=min(30, len(features_singular)-1))
features_tsne_singular = tsne_singular.fit_transform(features_singular)
# Create DataFrame for Plotly
tsne_singular_df = pd.DataFrame(features_tsne_singular, columns=['t-SNE 1', 't-SNE 2'])
tsne_singular_df['car name'] = singular_df['car name'].values
import plotly.express as px
# Plotly visualization
fig = px.scatter(tsne_singular_df, x='t-SNE 1', y='t-SNE 2',
color='car name',
hover_data=['car name'], # Hover over points to see car name
title='t-SNE Visualization of Singular Frequency Car Names')
# Update layout for cleaner visualization
fig.update_layout(
autosize=False,
width=800,
height=300,
xaxis=dict(title='',showticklabels=False, showgrid=False, zeroline=False),
yaxis=dict(title='',showticklabels=False, showgrid=False, zeroline=False),
title='t-SNE Visualization of "Other" (Rare) Car Names'
)
fig.show()
Observations:ΒΆ
The data points are widely scattered throughout the t-SNE plots, indicating limited clear structure or clustering patterns, which suggests the need for additional feature refinement or alternative dimensionality reduction techniques to better capture relationships.
- Although t-SNE visualized clusters by association with vehicle performance features, the display of the car name labels themselves show that:
- There are inconsistencies with how a same make of vehicle, i.e. Volkswagon, is designated as "VW"
- Vehicles' car name meaning can be improved by extracting the first term in the field
- Further grouping of car name may be performed using fuzzy matching as needed
Feature Engineering: Car Brand review and clean-upΒΆ
# Extract the first word from 'car name'
df['car_brand'] = df['car name'].str.split().str[0]
# View the unique brands
print(df['car_brand'].unique())
['chevrolet' 'buick' 'plymouth' 'amc' 'ford' 'pontiac' 'dodge' 'toyota' 'datsun' 'volkswagen' 'peugeot' 'audi' 'saab' 'bmw' 'chevy' 'hi' 'mercury' 'opel' 'fiat' 'oldsmobile' 'chrysler' 'mazda' 'volvo' 'renault' 'toyouta' 'maxda' 'honda' 'subaru' 'chevroelt' 'capri' 'vw' 'mercedes-benz' 'cadillac' 'mercedes' 'vokswagen' 'triumph' 'nissan']
ObservationΒΆ
- Many misspelled words, so fuzzy matching may be helpful
# Selective review of extracted car brands
df[df['car_brand'] == 'hi']
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | car name | index | car_brand | |
|---|---|---|---|---|---|---|---|---|---|---|
| 28 | 9.0 | 8 | 304.0 | 193.0 | 4732 | 18.5 | 70 | hi 1200d | 28 | hi |
ObservationsΒΆ
- Internet search results for "1970 1200d" revealed that this is a vintage International Harvester: "Overall, the 1970 International Harvester 1200D is a rare and sought-after vintage pickup truck, prized for its ruggedness, reliability, and nostalgic appeal."
- This confirms that 'car name' could be valuable in the Vintage Classification of the dataset
Decision PointΒΆ
- "hi" will be renamed "harvester", and the car name will remain as it was entered in the dataset
# Rename "hi" to "harvester"
df.loc[df['car_brand'] == 'hi', 'car_brand'] = 'harvester'
# Confirm update
df[df['car_brand'] == 'harvester']
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | car name | index | car_brand | |
|---|---|---|---|---|---|---|---|---|---|---|
| 28 | 9.0 | 8 | 304.0 | 193.0 | 4732 | 18.5 | 70 | hi 1200d | 28 | harvester |
Clean-up car brand with mappingΒΆ
# Cleanup typos with mapping of known inconsistencies
brand_mapping = {
'chevy': 'chevrolet',
'chevroelt': 'chevrolet',
'vw': 'volkswagen',
'vokswagen': 'volkswagen',
'toyouta': 'toyota',
'maxda': 'mazda',
'mercedes': 'mercedes-benz'
}
# Apply the mapping to standardize the brands
df['car_brand'] = df['car_brand'].replace(brand_mapping)
# Selective review of extracted car brands (capri vs ford vs mercury): Capri produced by Mercury, owned by Ford
print(df[df['car_brand'] == 'ford'][['model year', 'car name', 'index', 'car_brand']], "\n")
print(df[df['car_brand'] == 'mercury'][['model year', 'car name', 'index', 'car_brand']], "\n")
print(df[df['car_brand'] == 'capri'][['model year', 'car name', 'index', 'car_brand']], "\n")
model year car name index car_brand
4 70 ford torino 4 ford
5 70 ford galaxie 500 5 ford
17 70 ford maverick 17 ford
25 70 ford f250 25 ford
32 71 ford pinto 32 ford
36 71 ford torino 500 36 ford
40 71 ford galaxie 500 40 ford
43 71 ford country squire (sw) 43 ford
48 71 ford mustang 48 ford
61 72 ford pinto runabout 61 ford
65 72 ford galaxie 500 65 ford
74 72 ford gran torino (sw) 74 ford
80 72 ford pinto (sw) 80 ford
88 73 ford gran torino 88 ford
92 73 ford ltd 92 ford
100 73 ford maverick 100 ford
104 73 ford country 104 ford
112 73 ford pinto 112 ford
126 74 ford maverick 126 ford
130 74 ford pinto 130 ford
136 74 ford gran torino 136 ford
139 74 ford gran torino (sw) 139 ford
155 75 ford maverick 155 ford
159 75 ford ltd 159 ford
166 75 ford mustang ii 166 ford
168 75 ford pinto 168 ford
174 75 ford pinto 174 ford
190 76 ford gran torino 190 ford
193 76 ford maverick 193 ford
200 76 ford granada ghia 200 ford
206 76 ford pinto 206 ford
214 76 ford f108 214 ford
228 77 ford granada 228 ford
232 77 ford thunderbird 232 ford
236 77 ford mustang ii 2+2 236 ford
245 78 ford fiesta 245 ford
254 78 ford fairmont (auto) 254 ford
255 78 ford fairmont (man) 255 ford
264 78 ford futura 264 ford
282 79 ford fairmont 4 282 ford
286 79 ford ltd landau 286 ford
290 79 ford country squire (sw) 290 ford
314 80 ford fairmont 314 ford
336 80 ford mustang cobra 336 ford
351 81 ford escort 4w 351 ford
352 81 ford escort 2h 352 ford
365 81 ford granada gl 365 ford
373 82 ford fairmont futura 373 ford
389 82 ford granada l 389 ford
393 82 ford mustang gl 393 ford
396 82 ford ranger 396 ford
model year car name index car_brand
49 71 mercury capri 2000 49 mercury
67 72 mercury marquis 67 mercury
90 73 mercury marquis brougham 90 mercury
113 73 mercury capri v6 113 mercury
154 75 mercury monarch 154 mercury
224 77 mercury cougar brougham 224 mercury
251 78 mercury monarch ghia 251 mercury
259 78 mercury zephyr 259 mercury
281 79 mercury zephyr 6 281 mercury
287 79 mercury grand marquis 287 mercury
379 82 mercury lynx l 379 mercury
model year car name index car_brand
184 76 capri ii 184 capri
Decision PointΒΆ
- "capri ii", identified with a "capri" car brand would be better classified as a "mercury" car brand when queried with other mercury capri vehicles in the data set
- Thus, "mercury" will replace its brand, while "capri ii" will remain as its car name
# Rename "capri" car brand to "mercury"
df.loc[df['car_brand'] == 'capri', 'car_brand'] = 'mercury'
# Confirm updated car brand
df[df['car_brand'] == 'mercury']
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | car name | index | car_brand | |
|---|---|---|---|---|---|---|---|---|---|---|
| 49 | 23.0 | 4 | 122.0 | 86.0 | 2220 | 14.0 | 71 | mercury capri 2000 | 49 | mercury |
| 67 | 11.0 | 8 | 429.0 | 208.0 | 4633 | 11.0 | 72 | mercury marquis | 67 | mercury |
| 90 | 12.0 | 8 | 429.0 | 198.0 | 4952 | 11.5 | 73 | mercury marquis brougham | 90 | mercury |
| 113 | 21.0 | 6 | 155.0 | 107.0 | 2472 | 14.0 | 73 | mercury capri v6 | 113 | mercury |
| 154 | 15.0 | 6 | 250.0 | 72.0 | 3432 | 21.0 | 75 | mercury monarch | 154 | mercury |
| 184 | 25.0 | 4 | 140.0 | 92.0 | 2572 | 14.9 | 76 | capri ii | 184 | mercury |
| 224 | 15.0 | 8 | 302.0 | 130.0 | 4295 | 14.9 | 77 | mercury cougar brougham | 224 | mercury |
| 251 | 20.2 | 8 | 302.0 | 139.0 | 3570 | 12.8 | 78 | mercury monarch ghia | 251 | mercury |
| 259 | 20.8 | 6 | 200.0 | 85.0 | 3070 | 16.7 | 78 | mercury zephyr | 259 | mercury |
| 281 | 19.8 | 6 | 200.0 | 85.0 | 2990 | 18.2 | 79 | mercury zephyr 6 | 281 | mercury |
| 287 | 16.5 | 8 | 351.0 | 138.0 | 3955 | 13.2 | 79 | mercury grand marquis | 287 | mercury |
| 379 | 36.0 | 4 | 98.0 | 70.0 | 2125 | 17.3 | 82 | mercury lynx l | 379 | mercury |
ObservationΒΆ
- The "capri ii" is now categorized under the "mercury" car brand. However, further preprocessing of the car name field to extract explicit models (e.g., "car_model") will enhance subsequent reviews.
# Review unique brands
print(df['car_brand'].unique())
['chevrolet' 'buick' 'plymouth' 'amc' 'ford' 'pontiac' 'dodge' 'toyota' 'datsun' 'volkswagen' 'peugeot' 'audi' 'saab' 'bmw' 'harvester' 'mercury' 'opel' 'fiat' 'oldsmobile' 'chrysler' 'mazda' 'volvo' 'renault' 'honda' 'subaru' 'mercedes-benz' 'cadillac' 'triumph' 'nissan']
ObservationΒΆ
- The list of car brands are now consistent and succinct
- A fuzzy match will ensure there are no surprises between car_brand and car name
from fuzzywuzzy import fuzz
# Function to calculate fuzzy match score
def calculate_fuzzy_score(row):
return fuzz.ratio(row['car name'], row['car_brand'])
# Create a new DataFrame for testing
df_fuzzy_test = df[['car name', 'car_brand']].copy()
df_fuzzy_test['fuzzy_score'] = df_fuzzy_test.apply(calculate_fuzzy_score, axis=1)
# View potential mismatches
df_fuzzy_test_sorted = df_fuzzy_test.sort_values(by='fuzzy_score', ascending=True)
# Show mismatches for inspection
print(df_fuzzy_test_sorted.head(20)) # Display the lowest scores
car name car_brand fuzzy_score 28 hi 1200d harvester 12 45 amc hornet sportabout (sw) amc 21 394 vw pickup volkswagen 21 293 vw rabbit custom volkswagen 23 96 amc ambassador brougham amc 23 325 vw rabbit c (diesel) volkswagen 27 184 capri ii mercury 27 263 buick regal sport coupe (turbo) buick 28 43 ford country squire (sw) ford 29 9 amc ambassador dpl amc 29 66 amc ambassador sst amc 29 290 ford country squire (sw) ford 29 326 vw dasher (diesel) volkswagen 29 295 dodge colt hatchback custom dodge 31 140 amc matador (sw) amc 32 72 amc matador (sw) amc 32 197 vw rabbit volkswagen 32 139 ford gran torino (sw) ford 32 283 amc concord dl 6 amc 32 309 vw rabbit volkswagen 32
ObservationΒΆ
- The fuzzy match shows that car_brand can be reliabily used for visualization and clustering analysis
# Review count of unique car brands
df['car_brand'].value_counts()
car_brand ford 51 chevrolet 47 plymouth 31 amc 28 dodge 28 toyota 26 datsun 23 volkswagen 22 buick 17 pontiac 16 honda 13 mercury 12 mazda 12 oldsmobile 10 peugeot 8 fiat 8 audi 7 chrysler 6 volvo 6 renault 5 subaru 4 opel 4 saab 4 mercedes-benz 3 bmw 2 cadillac 2 triumph 1 harvester 1 nissan 1 Name: count, dtype: int64
# Review other single-record car brands
print(df[df['car_brand'] == 'nissan'], "\n")
print("---")
print("\n",df[df['car_brand'] == 'triumph'])
mpg cylinders displacement horsepower weight acceleration \
380 36.0 4 120.0 88.0 2160 14.5
model year car name index car_brand
380 82 nissan stanza xe 380 nissan
---
mpg cylinders displacement horsepower weight acceleration \
335 35.0 4 122.0 88.0 2500 15.1
model year car name index car_brand
335 80 triumph tr7 coupe 335 triumph
FindingsΒΆ
While Triumph is a single-record brand, similar to International Harvester, the Nissan Stanza is owned by Datsun and should be reviewed to determine whether it should be reassigned to this brand.
df[df['car_brand'] == 'datsun']
| mpg | cylinders | displacement | horsepower | weight | acceleration | model year | car name | index | car_brand | |
|---|---|---|---|---|---|---|---|---|---|---|
| 18 | 27.0 | 4 | 97.0 | 88.0 | 2130 | 14.5 | 70 | datsun pl510 | 18 | datsun |
| 29 | 27.0 | 4 | 97.0 | 88.0 | 2130 | 14.5 | 71 | datsun pl510 | 29 | datsun |
| 54 | 35.0 | 4 | 72.0 | 69.0 | 1613 | 18.0 | 71 | datsun 1200 | 54 | datsun |
| 81 | 28.0 | 4 | 97.0 | 92.0 | 2288 | 17.0 | 72 | datsun 510 (sw) | 81 | datsun |
| 110 | 22.0 | 4 | 108.0 | 94.0 | 2379 | 16.5 | 73 | datsun 610 | 110 | datsun |
| 129 | 31.0 | 4 | 79.0 | 67.0 | 1950 | 19.0 | 74 | datsun b210 | 129 | datsun |
| 145 | 32.0 | 4 | 83.0 | 61.0 | 2003 | 19.0 | 74 | datsun 710 | 145 | datsun |
| 173 | 24.0 | 4 | 119.0 | 97.0 | 2545 | 17.0 | 75 | datsun 710 | 173 | datsun |
| 204 | 32.0 | 4 | 85.0 | 70.0 | 1990 | 17.0 | 76 | datsun b-210 | 204 | datsun |
| 220 | 33.5 | 4 | 85.0 | 70.0 | 1945 | 16.8 | 77 | datsun f-10 hatchback | 220 | datsun |
| 241 | 22.0 | 6 | 146.0 | 97.0 | 2815 | 14.5 | 77 | datsun 810 | 241 | datsun |
| 247 | 39.4 | 4 | 85.0 | 70.0 | 2070 | 18.6 | 78 | datsun b210 gx | 247 | datsun |
| 268 | 27.2 | 4 | 119.0 | 97.0 | 2300 | 14.7 | 78 | datsun 510 | 268 | datsun |
| 273 | 23.9 | 4 | 119.0 | 97.0 | 2405 | 14.9 | 78 | datsun 200-sx | 273 | datsun |
| 303 | 31.8 | 4 | 85.0 | 65.0 | 2020 | 19.2 | 79 | datsun 210 | 303 | datsun |
| 312 | 37.2 | 4 | 86.0 | 65.0 | 2019 | 16.4 | 80 | datsun 310 | 312 | datsun |
| 320 | 37.0 | 4 | 119.0 | 92.0 | 2434 | 15.0 | 80 | datsun 510 hatchback | 320 | datsun |
| 324 | 40.8 | 4 | 85.0 | 65.0 | 2110 | 19.2 | 80 | datsun 210 | 324 | datsun |
| 333 | 32.7 | 6 | 168.0 | 132.0 | 2910 | 11.4 | 80 | datsun 280-zx | 333 | datsun |
| 347 | 37.0 | 4 | 85.0 | 65.0 | 1975 | 19.4 | 81 | datsun 210 mpg | 347 | datsun |
| 357 | 32.9 | 4 | 119.0 | 100.0 | 2615 | 14.8 | 81 | datsun 200sx | 357 | datsun |
| 362 | 24.2 | 6 | 146.0 | 120.0 | 2930 | 13.8 | 81 | datsun 810 maxima | 362 | datsun |
| 385 | 38.0 | 4 | 91.0 | 67.0 | 1995 | 16.2 | 82 | datsun 310 gx | 385 | datsun |
Decision PointΒΆ
There are no "Stanza" models in the above 'datsun" filtered car brand dataset. According to an internet search, while Nissan vehicles were sold under the Datsun brand until 1983, the Datsun brand was subsequently discontinued, and the Stanza XE was rebranded as the Nissan Stanza XE. Therefore, the single "Nissan" record will remain unchanged in this analysis.
t-SNE_reviewΒΆ
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
import warnings
# Suppress selected Future Warning
warnings.filterwarnings(
"ignore",
message="When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas.*",
category=FutureWarning
)
# Encode categorical car brands
le = LabelEncoder()
df['car_brand_encoded'] = le.fit_transform(df['car_brand'])
# Drop unnecessary columns and prepare features
features = df.drop(['car name', 'car_brand'], axis=1)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(features)-1))
features_tsne = tsne.fit_transform(features)
# Create a DataFrame for visualization
tsne_df = pd.DataFrame(features_tsne, columns=['t-SNE 1', 't-SNE 2'])
tsne_df['car_brand'] = df['car_brand']
tsne_df['car_brand_encoded'] = df['car_brand_encoded']
# Plot the t-SNE visualization
fig = px.scatter(tsne_df, x='t-SNE 1', y='t-SNE 2',
color='car_brand',
hover_data=['car_brand'])
# Update layout for cleaner visualization
fig.update_layout(
autosize=False,
width=800,
height=300,
xaxis=dict(title='',showticklabels=False, showgrid=False, zeroline=False),
yaxis=dict(title='',showticklabels=False, showgrid=False, zeroline=False),
title='t-SNE Visualization of Car Brands'
)
fig.show()
ObservationΒΆ
Clustering based on car brand and vintage classification still seem to be obscured by performance features. It seems reasonable therefore that removing these features will yield more coherent groupings (as it pertains to Vintage Classification than applying PCA for dimensionality reduction.
Hypothesis: Vintage ClassificationΒΆ
- Vintage classification is primarily determined by the interaction between car name and model year.
Null Hypothesis (π»0):
The interaction between car name and model year sufficiently explains vintage classification without the need for additional feature (i.e. performance features) transformations such as PCA.
Implication:
Clustering without PCA (using car name and model year along with other original features) will yield clusters of similar quality as clustering after PCA.
PCA will not significantly improve cluster separability or performance because the most relevant variance is already captured by car name and model year.
Alternative Hypothesis (π»π):
PCA will reveal latent features that significantly improve vintage classification beyond the interaction of car name and model year alone.
Implication:
- Clustering after PCA will produce significantly better-defined clusters (e.g., higher silhouette scores, better cluster separability).
- Other features or combinations of features (e.g., displacement, acceleration, or transformed horsepower) contain important latent information relevant to vintage classification.
KMeans without vs with Performance Features vs PCAΒΆ
# Begin with only car brand and model year for 3-cluster scoring using Silhouette Scores
# "Core": without performance features
# "Full": with performance features
# "PCA": with performance features using PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
# Suppress specific KMeans warning
warnings.filterwarnings("ignore", message="KMeans is known to have a memory leak on Windows with MKL")
# Label encode 'car_brand'
le = LabelEncoder()
df['car_brand_encoded'] = le.fit_transform(df['car_brand'])
# Core
# Drop performance features and keep 'car_brand_encoded'
features_core = df[['model year', 'car_brand_encoded']]
# Fit KMeans
kmeans_core = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_core = kmeans_core.fit_predict(features_core)
# Evaluate cluster quality
silhouette_core = silhouette_score(features_core, labels_core)
print(f'Silhouette Score (Core Features - Null Hypothesis): {silhouette_core}')
# Include all features, testing null hypothesis
features_full = df[['model year', 'car_brand_encoded', 'weight', 'displacement']]
# Fit KMeans
kmeans_full = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_full = kmeans_full.fit_predict(features_full)
# Evaluate cluster quality
silhouette_full = silhouette_score(features_full, labels_full)
print(f'Silhouette Score (Full Features with Non-Performance Metrics): {silhouette_full}')
# Compare alternative hypothesis with PCA + KMeans
from sklearn.decomposition import PCA
# Apply PCA to features before running KMeans
pca = PCA(n_components=2) # Reduce to 2 dimensions
#features_pca = pca.fit_transform(features_core) Silhouette Score (PCA Features - Alternative Hypothesis): 0.4035467744569793
features_pca = pca.fit_transform(features_full)
# Fit KMeans on PCA-reduced features
kmeans_pca = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_pca = kmeans_pca.fit_predict(features_pca)
# Evaluate cluster quality
silhouette_pca = silhouette_score(features_pca, labels_pca)
print(f'Silhouette Score (PCA Features - Alternative Hypothesis): {silhouette_pca}')
Silhouette Score (Core Features - Null Hypothesis): 0.40354677446266135 Silhouette Score (Full Features with Non-Performance Metrics): 0.5871997958450826 Silhouette Score (PCA Features - Alternative Hypothesis): 0.587706584336502
Silhouette ScoreΒΆ
Evaluation of the quality of clustering as represented by the Silhouette Score:
- Cluster Validity: A higher score indicates that clusters are well-formed and distinct.
- Optimal Number of Clusters: The silhouette score can help reveal and identify the ideal number of clusters.
- Cluster Interpretability: Assess whether the clustering reflects meaningful patterns in the data (e.g., does it align with vintage classification?).
Findings
- Core Features (Null Hypothesis):
- Silhouette Score: 0.4035
- This reflects a moderate clustering quality, suggesting that while model year and car brand alone can form clusters, they lack the nuance to capture fully distinct vintage classifications.
- Full Features (Non-Performance Metrics):
- Silhouette Score: 0.5872
- A substantial improvement, showing that adding weight and displacement leads to more cohesive and well-separated clusters.
- These features seem to carry valuable information about vehicle design or era-related trends, which help refine vintage classification.
- PCA on Full Features (Alternative Hypothesis):
- Silhouette Score: 0.5877
- Almost identical to the Full Features model without PCA.
- This suggests that PCA successfully reduces dimensionality while preserving important patterns, but it doesnβt improve cluster separability significantly beyond the raw features.
Next StepΒΆ
- Compare results with scaling
# Scaling all 3 models to compare
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Initialize StandardScaler
scaler = StandardScaler()
# Scale Core Features
features_core_scaled = scaler.fit_transform(features_core)
# Scale Full Features
features_full_scaled = scaler.fit_transform(features_full)
# Step 1: PCA on Scaled Full Features
pca = PCA(n_components=3)
features_pca_scaled = pca.fit_transform(features_full_scaled)
# Step 2: Fit KMeans and Evaluate Silhouette Scores
print('Scaled Silhouette Scoring:')
# Core Features Model
kmeans_core = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_core = kmeans_core.fit_predict(features_core_scaled)
silhouette_core = silhouette_score(features_core_scaled, labels_core)
print(f'Silhouette Score (Core Features - Scaled): {silhouette_core}')
# Full Features Model
kmeans_full = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_full = kmeans_full.fit_predict(features_full_scaled)
silhouette_full = silhouette_score(features_full_scaled, labels_full)
print(f'Silhouette Score (Full Features - Scaled): {silhouette_full}')
# PCA Features Model
kmeans_pca = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_pca = kmeans_pca.fit_predict(features_pca_scaled)
silhouette_pca = silhouette_score(features_pca_scaled, labels_pca)
print(f'Silhouette Score (PCA Features - Scaled): {silhouette_pca}')
Scaled Silhouette Scoring: Silhouette Score (Core Features - Scaled): 0.42145239754250424 Silhouette Score (Full Features - Scaled): 0.33382797545291965 Silhouette Score (PCA Features - Scaled): 0.3916454191001744
FindingΒΆ
- Core Features show the highest silhouette score when compared to full features with and without PCA
# Visualize the 3 models together
import plotly.express as px
from sklearn.manifold import TSNE
import pandas as pd
import warnings
# Suppress the FutureWarning from Plotly
warnings.filterwarnings("ignore", message="When grouping with a length-1 list-like")
# Decode 'car_brand_encoded' back to original 'car_brand'
df['car_brand'] = le.inverse_transform(df['car_brand_encoded'])
# t-SNE for Core Features
tsne_core = TSNE(n_components=2, random_state=42, perplexity=30)
features_tsne_core = tsne_core.fit_transform(features_core)
tsne_df_core = pd.DataFrame(features_tsne_core, columns=['t-SNE 1', 't-SNE 2'])
tsne_df_core['car brand'] = df.loc[features_core.index, 'car_brand']
tsne_df_core['cluster'] = labels_core
tsne_df_core['model'] = 'Core Features'
tsne_df_core['index'] = features_core.index
tsne_df_core['car name'] = df.loc[features_core.index, 'car name']
tsne_df_core['model year'] = df.loc[features_core.index, 'model year']
# t-SNE for Full Features
tsne_full = TSNE(n_components=2, random_state=42, perplexity=30)
features_tsne_full = tsne_full.fit_transform(features_full)
tsne_df_full = pd.DataFrame(features_tsne_full, columns=['t-SNE 1', 't-SNE 2'])
tsne_df_full['car brand'] = df.loc[features_full.index, 'car_brand']
tsne_df_full['cluster'] = labels_full
tsne_df_full['model'] = 'Full Features'
tsne_df_full['index'] = features_full.index
tsne_df_full['car name'] = df.loc[features_full.index, 'car name']
tsne_df_full['model year'] = df.loc[features_full.index, 'model year']
# t-SNE for PCA Features
tsne_pca = TSNE(n_components=2, random_state=42, perplexity=30)
features_tsne_pca = tsne_pca.fit_transform(features_pca)
tsne_df_pca = pd.DataFrame(features_tsne_pca, columns=['t-SNE 1', 't-SNE 2'])
tsne_df_pca['car brand'] = df.loc[features_full.index, 'car_brand']
tsne_df_pca['cluster'] = labels_pca
tsne_df_pca['model'] = 'PCA Features'
tsne_df_pca['index'] = features_full.index
tsne_df_pca['car name'] = df.loc[features_full.index, 'car name']
tsne_df_pca['model year'] = df.loc[features_full.index, 'model year']
# Combine t-SNE DataFrames
tsne_combined = pd.concat([tsne_df_core, tsne_df_full, tsne_df_pca])
# Plot with facet by model and custom hover
fig = px.scatter(tsne_combined,
x='t-SNE 1',
y='t-SNE 2',
color='cluster',
facet_col='model',
title='t-SNE Comparison of Core vs Full vs Full with PCA')
# Customize hovertemplate
fig.update_traces(
hovertemplate='<br>Car Brand: %{customdata[0]}<br>Car Name: %{customdata[1]}<br>Model Year: %{customdata[2]}<br>Cluster: %{customdata[3]}<extra></extra>',
customdata=tsne_combined[['car brand', 'car name', 'model year', 'cluster']].to_numpy()
)
# Suppress t-SNE axis labels for all facets
fig.update_xaxes(title='', showticklabels=False, showgrid=False, zeroline=False, matches='x')
fig.update_yaxes(title='', showticklabels=False, showgrid=False, zeroline=False, matches='y')
# Control dimensions
fig.update_layout(
autosize=False,
width=800,
height=400,
coloraxis_showscale=False
)
fig.show()
General ObservationsΒΆ
- It seems PCA and the Full Features without PCA are very much alike; the shape of these models also resemble the original t-SNE preview and review visualizations
- Core Features model seems to reflect clear, distinct clusters, with apparent separation between clusters; this seem to suggest meaningful differences between data points
- An evaluation of optimal number of clusters should be examined next
ElbowPlotsΒΆ
# Review optimal clusters with Elbow Plot
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Define a function to create the elbow plot
def plot_elbow(features, title):
inertia_values = []
cluster_range = range(1, 11) # Test 1 to 10 clusters
for k in cluster_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(features)
inertia_values.append(kmeans.inertia_)
# Plot the elbow plot
plt.figure(figsize=(4, 3))
plt.plot(cluster_range, inertia_values, marker='o', linestyle='--')
plt.title(f'{title}', fontsize=10)
plt.xlabel('Number of Clusters (k)', fontsize=8)
plt.ylabel('Inertia (WCSS)', fontsize=8)
plt.yticks(fontsize=8)
plt.grid(True)
plt.show()
# Elbow Plot for Core Features (Only Car Brand and Model Year)
plot_elbow(features_core, 'Core Features (Car Brand + Model Year)')
# Elbow Plot for Full Features (Model Year + Car Brand + Non-Performance Metrics)
plot_elbow(features_full, 'Full Features')
# Elbow Plot for PCA Features (Dimensionality Reduced Features)
plot_elbow(features_pca, 'PCA Features (Full Features Reduced to 2D)')
FindingsΒΆ
- All 3 models agree that k = 2 is the optimum number of clusters
- Re-run KMeans where k = 2 for all 3 models
- Also re-run as k=2 to compare scaling for all 3 models
Clustering iterations with 2 ClustersΒΆ
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
# Suppress specific KMeans warning
warnings.filterwarnings("ignore", message="KMeans is known to have a memory leak on Windows with MKL")
def preprocess_data(df):
"""Encodes categorical variables and prepares data for clustering."""
# Label encode 'car_brand'
le = LabelEncoder()
df['car_brand_encoded'] = le.fit_transform(df['car_brand'])
return df
def get_features(df):
"""Define the feature sets for clustering."""
# Core features: minimal set
features_core = df[['model year', 'car_brand_encoded']]
# Full features: include additional metrics
features_full = df[['model year', 'car_brand_encoded', 'weight', 'displacement']]
return features_core, features_full
def calculate_silhouette_scores(features_core, features_full):
"""Calculates silhouette scores for different feature sets, scaling, and cluster counts."""
results = []
scaler = StandardScaler()
# Core Features: Scaled and Unscaled
features_core_scaled = scaler.fit_transform(features_core)
for scaling, features in zip(["Unscaled", "Scaled"], [features_core, features_core_scaled]):
for k in [2, 3]:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(features)
silhouette = silhouette_score(features, labels)
results.append({
"Feature Set": "Core Features",
"Scaling": scaling,
"Clusters": k,
"Silhouette Score": silhouette
})
# Full Features: Scaled and Unscaled
features_full_scaled = scaler.fit_transform(features_full)
for scaling, features in zip(["Unscaled", "Scaled"], [features_full, features_full_scaled]):
for k in [2, 3]:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(features)
silhouette = silhouette_score(features, labels)
results.append({
"Feature Set": "Full Features",
"Scaling": scaling,
"Clusters": k,
"Silhouette Score": silhouette
})
# PCA Features: Scaled and Unscaled
for scaling, features in zip(["Unscaled", "Scaled"], [features_full, features_full_scaled]):
pca = PCA(n_components=2)
features_pca = pca.fit_transform(features)
for k in [2, 3]:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(features_pca)
silhouette = silhouette_score(features_pca, labels)
results.append({
"Feature Set": "PCA Features",
"Scaling": scaling,
"Clusters": k,
"Silhouette Score": silhouette
})
return pd.DataFrame(results)
def print_results(df_silhouette):
"""Prints silhouette scores in the desired format."""
for scaling in ["Unscaled", "Scaled"]:
print(f"2-Cluster, {scaling} Silhouette Scoring:")
filtered = df_silhouette[(df_silhouette["Clusters"] == 2) & (df_silhouette["Scaling"] == scaling)]
for _, row in filtered.iterrows():
print(f"Silhouette Score ({row['Feature Set']} - {scaling}): {row['Silhouette Score']}")
print()
def main(df):
# Step 1: Preprocess Data
df = preprocess_data(df)
# Step 2: Get Feature Sets
features_core, features_full = get_features(df)
# Step 3: Calculate Silhouette Scores
df_silhouette = calculate_silhouette_scores(features_core, features_full)
# Step 4: Print Results
print_results(df_silhouette)
# Step 5: Return DataFrame for further analysis
return df_silhouette
# Run the models: Full vs Core, Scaled vs Unscaled, Produce corresponding silhouette scores
df_silhouette = main(df)
2-Cluster, Unscaled Silhouette Scoring: Silhouette Score (Core Features - Unscaled): 0.5565510187158499 Silhouette Score (Full Features - Unscaled): 0.6355437404585835 Silhouette Score (PCA Features - Unscaled): 0.6357939678511504 2-Cluster, Scaled Silhouette Scoring: Silhouette Score (Core Features - Scaled): 0.39066681853629537 Silhouette Score (Full Features - Scaled): 0.3851821168576777 Silhouette Score (PCA Features - Scaled): 0.4796259476891685
Statistical Benchmarking and Cross-Context ValidationΒΆ
- While silhouette scores are meaningful within their immediate contexts for comparison, i.e. 2-cluster vs 3-cluster, scaled vs unscaled, we can more meaningfully compare across models using statistical comparisons of the clustering models
- Coefficient of Variation (CV) and Relative Range can be applied as key metrics to provide a global measure of clustering performance and stability.
- Although all the elbow plots of the three models have unanimously identified having 2 clusters as optimal, like the sihouette score, the elbow point is not an absolute metric.
- The elbow plot provides guidance rather than definitive answers in the absolute sense.
- The results from the elbow plots must therefore be interpreted in conjunction with other metrics like stability and reliability metrics, as well as domain knowledge.
Stability and Reliability MetricsΒΆ
# Coefficient of Variation: Quantify relative variability across model scenarios
# CV by Feature Set and Scaling (Aggregate Clusters)
cv_results_feature_scaling = df_silhouette.groupby(['Feature Set', 'Scaling'])['Silhouette Score'].apply(
lambda x: (np.std(x, ddof=1) / np.mean(x)) * 100
)
print("\nCoefficient of Variation (CV) by Feature Set and Scaling:")
print(cv_results_feature_scaling)
# CV by Feature Set and Clusters (Aggregate Scaling)
cv_results_feature_clusters = df_silhouette.groupby(['Feature Set', 'Clusters'])['Silhouette Score'].apply(
lambda x: (np.std(x, ddof=1) / np.mean(x)) * 100
)
print("\nCoefficient of Variation (CV) by Feature Set and Clusters:")
print(cv_results_feature_clusters)
# CV by Scaling and Clusters (Aggregate Feature Set)
cv_results_scaling_clusters = df_silhouette.groupby(['Scaling', 'Clusters'])['Silhouette Score'].apply(
lambda x: (np.std(x, ddof=1) / np.mean(x)) * 100
)
print("\nCoefficient of Variation (CV) by Scaling and Clusters:")
print(cv_results_scaling_clusters)
Coefficient of Variation (CV) by Feature Set and Scaling:
Feature Set Scaling
Core Features Scaled 5.360960
Unscaled 22.537358
Full Features Scaled 10.100793
Unscaled 5.591415
PCA Features Scaled 9.242436
Unscaled 5.558300
Name: Silhouette Score, dtype: float64
Coefficient of Variation (CV) by Feature Set and Clusters:
Feature Set Clusters
Core Features 2 24.766815
3 3.069382
Full Features 2 34.687551
3 38.904567
PCA Features 2 19.800160
3 23.408187
Name: Silhouette Score, dtype: float64
Coefficient of Variation (CV) by Scaling and Clusters:
Scaling Clusters
Scaled 2 12.668079
3 12.855771
Unscaled 2 7.496989
3 20.180273
Name: Silhouette Score, dtype: float64
Findings: Coefficient of Variation: Stability / ReliabilityΒΆ
- The CV analysis demonstrates that Full Features with 3 Clusters is a highly variable model and therefore less reliable for drawing robust insights.
- The Core Features with 3 Clusters has the lowest CV (3.07%):
- This indicates that the silhouette scores for this configuration are highly consistent across different contexts.
- It also suggests that the clustering results are stable, making this model more reliable for drawing conclusions.
# Range and Relative Range by Feature Set and Scaling
range_feature_scaling = df_silhouette.groupby(['Feature Set', 'Scaling'])['Silhouette Score'].agg(
Range=lambda x: x.max() - x.min(),
Relative_Range=lambda x: ((x.max() - x.min()) / x.max()) * 100 if x.max() > 0 else np.nan
)
print("\nRange and Relative Range by Feature Set and Scaling:")
print(range_feature_scaling)
# Range and Relative Range by Feature Set and Clusters
range_feature_clusters = df_silhouette.groupby(['Feature Set', 'Clusters'])['Silhouette Score'].agg(
Range=lambda x: x.max() - x.min(),
Relative_Range=lambda x: ((x.max() - x.min()) / x.max()) * 100 if x.max() > 0 else np.nan
)
print("\nRange and Relative Range by Feature Set and Clusters:")
print(range_feature_clusters)
# Range and Relative Range by Scaling and Clusters
range_scaling_clusters = df_silhouette.groupby(['Scaling', 'Clusters'])['Silhouette Score'].agg(
Range=lambda x: x.max() - x.min(),
Relative_Range=lambda x: ((x.max() - x.min()) / x.max()) * 100 if x.max() > 0 else np.nan
)
print("\nRange and Relative Range by Scaling and Clusters:")
print(range_scaling_clusters)
Range and Relative Range by Feature Set and Scaling:
Range Relative_Range
Feature Set Scaling
Core Features Scaled 0.030786 7.304640
Unscaled 0.153004 27.491504
Full Features Scaled 0.051354 13.332431
Unscaled 0.048344 7.606706
PCA Features Scaled 0.058845 12.268954
Unscaled 0.048087 7.563360
Range and Relative Range by Feature Set and Clusters:
Range Relative_Range
Feature Set Clusters
Core Features 2 0.165884 29.805749
3 0.017906 4.248552
Full Features 2 0.250362 39.393295
3 0.253372 43.149167
PCA Features 2 0.156168 24.562677
3 0.166926 28.402902
Range and Relative Range by Scaling and Clusters:
Range Relative_Range
Scaling Clusters
Scaled 2 0.094444 19.691143
3 0.087624 20.791060
Unscaled 2 0.079243 12.463621
3 0.184160 31.335332
Findings: Range and Relative RangeΒΆ
- By Scaling: Core Features (Scaled) has the lowest relative range among scaled sets (7.30), indicating high stability in scaled data. Core Features (Unscaled) has the lowest relative range among unscaled sets (27.49), showing greater stability in raw form compared to other unscaled features.
- By Clusters: The most stable cluster configuration appears for Core Features with 3 Clusters (4.25), which has the lowest relative range overall.
- By Scaling and Clusters: Unscaled with 2 Clusters has the lowest relative range (12.46), suggesting that unscaled features are more stable with fewer clusters.
Therefore, the most stable overall configuration is Core Features with 3 Clusters. Most stable scaled data is Core Features (Scaled). Most stable unscaled data is Core Features (Unscaled) (27.49). From these, Core Features with 3 Clusters is the most stable and reliable configuration overall.
Visualization of FindingsΒΆ
import seaborn as sns
import matplotlib.pyplot as plt
# Add Adjusted Silhouette Score (1 - x)
df_silhouette['Adjusted Score'] = 1 - df_silhouette['Silhouette Score']
# Create the figure and subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Subplot 1: Bar Chart
sns.barplot(ax=axes[0], x='Feature Set', y='Adjusted Score', data=df_silhouette, ci=None)
axes[0].set_title('Average Reliability/Stability Score by Feature Set')
axes[0].set_ylabel('Adjusted Silhouette Score (1 - Silhouette Score)')
axes[0].set_xlabel('Feature Set')
# Subplot 2: Heatmap
heatmap_data = df_silhouette.pivot_table(
index='Feature Set', columns='Clusters', values='Adjusted Score', aggfunc='mean'
)
sns.heatmap(heatmap_data, annot=True, fmt='.2f', cmap='coolwarm', ax=axes[1])
axes[1].set_title('Reliability/Stability Measure: Feature Set x Clusters')
axes[1].set_xlabel('Number of Clusters')
axes[1].set_ylabel('Feature Set')
# Adjust layout for better spacing
plt.tight_layout()
# Show the combined figure
plt.show()
Findings: Magnitude of variability relative to the highest silhouette score in each contextΒΆ
- Core Features with 3 Clusters consistently show the lowest Relative Range across all groupings, further validating their stability and reliability.
- Full Features with 3 Clusters consistently have the highest Relative Range, indicating significant instability.
Hypothesis Testing Results, Conclusion & RecommendationsΒΆ
Null Hypothesis (π»0):
- The interaction between car name and model year sufficiently explains vintage classification without PCA.
Supporting Evidence:
- Core Features with 3 Clusters consistently outperforms other feature sets in CV (3.07%) and Relative Range (4.25%).
- Low silhouette score variability confirms that adding more dimensions with Full Feature Set, or reducing dimensionality using PCA, introduces noise rather than clarity.
Alternative Hypothesis (π»π):
- PCA reveals latent features that significantly improve vintage classification.
Counter Evidence:
- PCA Features with 2 Clusters improve relative variability over most setups, but PCA features introduce higher complexity without outperforming.
- No configuration after PCA significantly outperforms Core Clusters for Vintage scoring.
Conclusion & Recommendations
Support for Null Hypothesis (π»0):
- Clustering based on car name and model year is sufficient for stable and reliable vintage classification.
- Core Features with 3 Clusters demonstrates lowest variability and most consistent clustering behavior.
1. Conclusion: The Core Features with 3 Clusters (see Cluster Diagram and Data Table below) can be leveraged to extract valuable insights about the cars, grouping them to more effectively target the audience. Further analysis with domain experts will be necessary to achieve this.
2. Recommendations:
- Start with the simpler Core Features, avoiding unnecessary transformations.
- PCA may introduce latent dimensions, but for this dataset, performance improvement was marginal with greater variability in cluster quality.
- Domain expert(s) should review the 3-Cluster report to apply labels onto the dataset; these labels then can be further leveraged in supervised learning (see below).
- In collaboration with domain expert(s):
- Further feature engineer the dataset, i.e. the vehicles' model using 'car name' to enhance clustering by car_brand, year, and car_model; performance features (i.e. horespower) should be reviewed and validated as reliable basis for classifying "Vintage" status (see below).
- Further iterations in unsupervised learning may be pursued, exploring the use of multi-model ensemble learning.
- Supervised learning can be used in lieu of unsupervised learning if Vintage Classification relies heavily on domain expert labelling.
- Use of Vintage Classification can be further discussed as it pertains to binary (i.e. Vintage vs Non-Vintage) versus multiple classes (i.e. Classic vs Retro vs Modern vs Emerging Vintage).
- Vintage Classification can be explored as a continuous score using a regression-based approach to combine weighted factors.
- The model can be leveraged to target audience segmentation: Group potential buyers by their affinity for specific vintage categories.
- The model can be leveraged for marketing insights: Identify which features drive perceptions of vintage value.